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Abstract 

We consider the unconstrained optimization problem whose objective function is 
composed of a smooth and a non-smooth conponents where the smooth compo- 
nent is the expectation a random function. This type of problem arises in some 
interesting applications in machine learning. We propose a stochastic gradient 
descent algorithm for this class of optimization problem. When the non-smooth 
component has a particular structure, we propose another stochastic gradient de- 
scent algorithm by incorporating a smoothing method into our first algorithm. The 
proofs of the convergence rates of these two algorithms are given and we show the 
numerical performance of our algorithm by applying them to regularized linear 
regression problems with different sets of synthetic data. 



1 Introduction 

In the past decade, convex programming has been widely applied in a variety of areas including 
statistical estimation, machine learning, data mining and signal processing. One of the most popular 
classes of convex programming problems, which appears in many different applications such as 
lasso [22] and group lasso [27], can be formulated as the following minimization problem, 

mm<j>(x) = f(x) + h(x). (1) 

X 

Here, the function f(x) is smooth and convex and its gradient V/(x) is Lipschitz continuous with 
a Lipschitz constant L. The function h(x) is assumed to be convex but non-smooth. 

Interior-point methods [2, 14] are considered as general algorithms for solving different types of 
convex programming. However, they are not scalable for problems with even moderate sizes due to 
the big cost of solving the Newton linear equations system in each main iteration. A block coordinate 
method was developed by Tseng and Yun [26] and applied to the problems which can be formulated 
by (1) in [15, 5]. However, this method requires a separable structure in the objective function that 
does not exist in some applications such as overlapped group lasso [10]. 

Recently, gradient descent methods, or so called first-order methods, e.g. [18, 25, 1], have attracted 
great interest because they are not only relatively easy to implement but also capable of solving 
some challenging problems with huge size. For problems formulated by (1), the first-order methods 
proposed in [18, 25, 1] can achieve aO(p) convergence rate, where N is the number of iterations. 
The different variations of gradient descent algorithm have been successively applied to different 
types of problems, for example, nuclear norm regularization [20, 24], i^/^-norm regularization 
[13] and so on. 

In each loop of a gradient descent algorithm, a projection mapping, which itself is a minimization 
problem, must be solved in order to find the next intermediate solution. Although a projection 
mapping usually has a closed form solution which guarantees the efficiency of a gradient descent 
algorithm, there exists a class of problems formulated by (1) including overlapped group lasso [10] 
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and fused lasso [23], for which a closed form solution for the projection mapping is not available. 
Fortunately, the non-smooth component h(x) in the objective function usually has a particular type 
of structure of the form 

h(x) = maxv T Ax, (2) 

veQ 

where Q is a compact set. Nesterov [19] proposed a scheme to construct a smooth approximation of 
the objective function 4>{x) in (1) whenever h(x) satisfies (2) and also a gradient descent algorithm to 
minimize the approximated problem. This approximation scheme and algorithm have been applied 
to overlapped group lasso [4] and fused lasso [3] and given good numerical results. 

A stochastic gradient descent algorithm can be considered as a gradient descent algorithm that uti- 
lizes random approximations of gradients instead of exact gradients. During the past few years, a 
significant amount of work has been done to develop stochastic gradient descent algorithms for dif- 
ferent problems (see, e.g. [16, 11,6, 8]). One reason for people to consider stochastic gradient is that 
the exact gradients are computationally expensive or sometimes even impossible to evaluate. One 
typical case to consider stochastic gradients is stochastic optimization where the objective function 
f(x) is given as an expectation of a random value function F(x, £), i.e. f(x) — EF(x, £), where £ 
is a random variable and the expectation is taken over £. In this situation, a multidimensional numer- 
ical integral would be needed to compute the exact gradient V/(x) = EVF(i, £). That would be 
too time consuming especially when the dimension is high. It could be even worse if the distribution 
of £ is unknown so that there is no way to get the exact gradient V/(x). 

In this paper, we consider the optimization problem formulated by (1) but we further assume the 
smooth component of the objective function to be of the form f(x) = EF(x,^) just as it is the 
case in a stochastic optimization problem. Hence, we have to consider the stochastic gradient for 
the reasons we mentioned above. We propose a stochastic gradient descent algorithm to solve (1) 
under the assumption that a stochastic approximation of V/(x), denoted by G(x 1 £), is available in 
each iteration, where £ is a random variable. We also assume G(x, £) is an unbiased estimate of 
Vf(x), i.e., EG(x,£) = Vf(x), and E||V/(x) - G(x,£)\\ 2 < a 2 for some nonnegative constant 
a. 1 We show that our stochastic gradient algorithm obtain a convergence rate of 0(-^=), which 
is the same, up to a constant independent of N, as the convergence rates showed in [16, 11, 6, 8] 
without assuming strongly convexity for the objective functions. 

Our algorithms can viewed as an extension of the Algorithm 1 in [25] by utilizing stochastic gra- 
dients. The choices of the parameters j t and 7* in our algorithms are inspired by the choices of 
similar parameters in [11]. But our algorithm is different from Lan's accelerated stochastic approx- 
imation method in [11] in that they assumed G(x, £) is a stochastic subgradient for the non-smooth 
objective function 4>{x) in (1) while we assume G(x, £) is a stochastic gradient only for the smooth 
component f(x) in (1). Although our method is similar to a simplified version of AC-SA algorithm 
proposed in [6], we use a different choice of parameters and focus more on the effect of smoothing 
technique in this paper. 

Similar to the exact gradient descent algorithm, the existing stochastic gradient descent algorithms, 
e.g. [6, 8], have to solve a projection mapping in each iteration, which may not have a closed 
form solution. However, when the function h(x) in (1) has the structure given in (2), we propose 
another stochastic gradient descent algorithm by incorporating the smoothing technique proposed 
by Nesterov [19]. This method replaces h(x) by its smooth approximation such that the projection 
mapping always obtains a closed form solution. Hence, our method can be applied to problems like 
overlapped group lasso and fused lasso, which other stochastic gradient descent algorithm can not 
solve efficiently. 

According to [19, 12], the convergence rates of the accelerated gradient algorithms will be reduced 
from 0(^2) to O(jf) if the smoothing technique in [19] is applied. However, we show that the 
convergence rate for our stochastic gradient algorithm remains 0{^=) even when the smoothing 
technique is applied. In other word, although the price of the smoothing technique is kind of high 
for deterministic gradient methods, it is totally free for stochastic gradient methods. 

The rest of this paper is organized as follows: in the next section, we present our stochastic gradient 
descent algorithm and prove its convergence rate. Combining our first algorithm with a smoothing 



'in this paper, the notation || • || without any subscript presents the Euclidean norm of a vector. 
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technique, we propose another stochastic gradient descent algorithm in Section 3. In Section 4, we 
show the numerical results on simulated data, followed by a concluding section at the end. 



2 Stochastic Gradient Descent Algorithm 

In this section, we propose a stochastic gradient algorithm summerized in Algorithm 1 below to 
solve the following optimization problem: 

mm<f>(x) = f(x) + h(x) = EF(x,£) + h(x). (3) 

X 

where the expectation E is taken over the random variable £. We assume that at every point x, there 
is a random vector G(x, £) determined by x and £ such that KG(x, £) = V/(x) and E|| V/(x) — 
G(x,^)\\ 2 < a 2 for any x. In the fth step of Algorithm 1, we independently generate a random 
variable £ t from the distribution of £ and compute G(y t ,£t) at a point y t based on £ t . 

Algorithm 1 Stochastic Gradient Descent Algorithm (SG) 

Input: The total number of iterations N and the Lipschitz constant L. 

3 

Initialization: Choose 9 t = ^ and 7 t = ^2(^77 + 2). Set a; = 0, z = and t = 0. 
Iterate fori = 0,1, 2,... AT: 

1. y t = (1 - t )ar t + 6^ 

2. Generate a random vector £ t independently from the distribution of £ 

3. z t+1 = argmin^x, G(j/ t ,&)) + ^\\x - z t \\ 2 + h(x)} 

4. x t+1 = (1 - t )x t + t z t +i 
Output: xjv + i 

Remark 1. Algorithm 1 is based on the Algorithm 1 proposed by Tseng in [25]. Algorithm 1 is 
different from Tseng 's algorithm in two aspects. First, the exact gradient used in Tseng 's algorithm 
is replaced by the stochastic gradient due to the difficulty of computing the exact gradient in our 
problems as mentioned in Section 1. Second, two sequences of step lengths, 6 t and j t > are main- 
tained to guarantee the convergence of our algorithm while in Tseng's algorithm, one sequence of 
step length {9 t } is enough. It should be pointed out that if we set j t in Algorithm 1 to be L ^' +1 ^ with 

^2L, 2(7 3|[i^_^p +2 ^ 2 1' Algorithm 1 just becomes the AC-SA algorithm proposed 

by Ghadimi and Lan [6] for unconstrained optimization problems when f(x) is just convex but not 
necessarily strongly convex. However, the parameter 7* in the AC-SA algorithm is hard to evaluate 
because it depends on the optimal solution x* and a while the parameters in our algorithm are 
relatively simple and result in better numerical performances as shown in Section 4. 

Here, we assume the projection mapping in step 3 in Algorithm 1 can be solved efficiently or has 
a closed form solution. This is true in many problems where the non-smooth term h(x) is ^i-norm 
[22], 4/^ 2 -norm [13, 27, 9] or nuclear norm of x [20, 24]. 

Using the same notations in Algorithm 1, we present the convergence rate of Algorithm 1 in the 
following theorem. Some techniques in the proof are inspired by the proofs for the complexity 
results in [25] and [11]. 

Theorem 1. Suppose N is the total number of iterations in Algorithm 1 and x* is the optimal 
solution of (3) and we assume that the stochastic gradient G(x, £) satisfies E|| V/(x) — G(x, £)|| 2 < 
a 2 for all x. Then we have 

ID 2 4- rr 2 AD 2 4- In 2 

where D = \\x* — z ||. 

Three technical lemmas are presented here before the proof of the convergence rate of Algorithm 
1 is given. Lemma 1 is an inequality satisfied by the step lengths we choose in Algorithm 1 and 
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Lemma 2 is a basic property of convex functions. Lemma 3, shown in [1 1], is a technical result used 
to characterize the optimal solution of the projection mapping in step 3 of Algorithm 1. We put the 
proof of Lemma 3 here just for the completeness. 

Lemma 1. Suppose the sequences {8 t } and {j t } are chosen as in Algorithm 1, we have g 1 -^ < 
e^; and j t > t . 

Proof. The first inequality comes from 



(i - gt+i) = m± - ml = 1 + 1 < 1 + 2 = Tt+i 
Ot+i " ^3 t + 2 - 1 + 3 7t 



The second one is obvious. □ 

Lemma 2. If f(x) is a smooth convex function on R n and V/(x) is Lipschitz continuous with a 
Lipschitz constant L, then we have 

f(y) + (x- y, V/(y)) < f(x) < f(y) + (x - y, V/(y)) + ^\\x - y\\ 2 

for all x, y. 

This is a classical property of convex functions. For a proof, see [7]. 

Lemma 3. (See also [25], [11 ] and [6]) Suppose t[}{x) is convex and z* is the optimal solution of 
min z ip(z) + \ || z — zj| 2 , then we have the following inequality: 

V>(-2*) + \\\z* < ^{x) + \\\ x ~ z\? - \ \\x~z*\\ 2 

for all x. 

Proof. The definition of z* implies that there exists a subgradient r\ in dip(z*), the subdifferential 
of function ip(z) at z*, such that 

{■q + z* -z,x-z*) > for all x. (4) 
And the convexity of ip(x) implies 

tp(x) > tp(z*) + (r), x — z*) for all x. (5) 

It is easy to verify that 

-\\z- .cc|| 2 = -\\z - z*\\ 2 + {z* -z,x- z*) + -\\z* - a;|| 2 for all x. (6) 
2 ii ii 2 " ii \ > / 2 " " 

Using the (4)(5)(6) above, we conclude that 

^x) + \\\x-zf = 1 p(x)+ 1 -\\z-z*\\ 2 + (z*-z,x-z*)+ 1 2 \\z*-xf 

> tp(z*) + ^\\z - z*\\ 2 + (t] + z* - z,x - z*) + ^\\z* - x\\ 2 

> tjj{z*) + \\\z - z*\\ 2 + h\z* - x\\ 2 forallx. 



□ 



Here, we give the proof of Theorem 1 . 
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Proof. We define A t = V/(y t ) - G(y t ,£t) so that EA 4 = and E||A t || 2 < a 2 . We can bound 
<f>(xt+i) from above as follows. 



<t>{xt+i) = f(x t +i) + h(x t+1 ) 

< f(y t ) + (x t+1 - y t , V/(y t )> + \ ll^+i - Vtf + h(x t+1 ) 

< (1 - W(lfe) + (x t - Vu V/(y t )) + fc(*t)) + 

<M/(lfc) + (*t+i - Vt, V/(|/ t )) + M^+i)) + e t 2 |||z t+1 - z t || 2 

< (1 - t M* t ) + t (/(ift) + (zt+i - yt,Vf(y t )) + h(z t+1 )) 

+e^\\z t+1 - Zt f 

= (1 - t M*t) + (9 t (/(ite) + (*t+i - yt,G(yt,£t)) + h(z t+1 ) + lt ^\\z t+1 - z t \\ 2 ) 
+(0 2 t - 0*7t)§ll*t+i -ztf + t (z t+1 - y t , A t ) (7) 



The first and third inequalities above are due to Lemma 2 and the second one is implied by the 
updating equations for y t and x t+ i and the convexity of h(x). 

According to Lemma 3 with ij)(z) = — ^-((z, G(y t , &)) + h(z)), z* = z t+ \ and z = z u we get 



((z t+u G(y u ^)) + h(z t+1 )) + ^\\z t+1 - z t \\ 2 (8) 
< ((x, G(y u &)) + h(x)) + ^\\ x - Zt \\ 2 -l^\\ x - Zt+1 \\ 2 for all x. 



By choosing x = x* in (8), it follows from (7) and (8) that 

4>(xt+i) < (1 - flt)0(a; t ) + t (/(ife) + (or* - y t , + &0O +7t|lk* - ^l| 2 ) 

-Btlt\\\x* - z t+ i|| 2 + (tf 2 - e tlt )^\\z t+1 - z t \\ 2 + 6 t {zt+i - y t ,A t ) 
= (1 - O t )<l>{x t ) + 6 t (f(y t ) + (x* - y t ,V.f(yt)) + h(x*) + 7t ^\\x* - z t \\ 2 ) 
-Qat\\\x* - z t+1 f + (9 2 - 0at)^\\zt+i - z t \\ 2 + t (zt+i - x\ A t ) . 

Here, the equality above holds because A t = V/(yt) — G(y t , £t). 
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By Lemma 2, the term f(yt) + (a;* — yt, V/(y t )) + h(x*) in (9) is no more than (j>(x*). Hence, we 
can upper bound (f>(x t +i) as: 

<f>{x t +i) < (i - Ot)<t>{x t ) + e t( j)(x*) + e t -f t -\\x* - z t \\ 2 - ea t -\\x* - z t+1 \\ 2 

-(Otft - 2 )^\\z t+1 - z t \\ 2 + t (z t+1 - yt, A t ) - 6 t (x* - y t , A t ) 

= (i - o t )<t>(x t ) + e t <t>(x*) + e tlt \\\x* - ztf - e tlt \\\x* - z t+1 \\ 2 

-{Oat - 2 )^\\z t+1 - z t f + 6 t (z t+1 - z u A t ) + 9 t (zt - x*,A t ) 

< (i - e t )4>{x t ) + e t 4>(x*) + e t it\\\x* - z t \\ 2 - e at \\\x* - z t+1 \\ 2 

-{Oat - 6 2 )^\\z t+1 - z t \\ 2 + d t \\z t+1 - z t \\\\A t \\ + t (z t - x*,A t ) 

< (1 - e t )<p{x t ) + 6 t <j>(x*) + 9at^\\x* - z t \\ 2 - 6 t jt^\\x* - z t+1 \\ 2 

Ot\\A t \\ 2 

+ 2L( Jt -0 t ) +dt{Zt - X ' At) ' 

We get the second inequality above by applying Cauchy-Schwarz inequality to (z t +i — Zt, A t ) and 
the last inequality comes from applying the inequality —ax 2 + bx < |^ with a > to a = (6 t "ft — 
9 2 ), x = \\z t+ i - z t \\ and b = t ||V t ||. Note that {9 tlt - 6 2 ) > from Lemma 1. 

Until now, we have already got 

0(*t+i) < (1 -6 t )<P(x t ) + 6t<j>(x*) + B t lt\\\x* - z t f - dtlt\\\x* ~z t+1 f (10) 

+ 2L{ lt -9t) +dt{Zt - X ' At) ' 

We define E((z t — x* , A t ) |£i, . . . , £t-i) to be the conditional expectation of (z t — x* , A t ) under 
the condition that £i , . . . , £ t _i have been generated. According to Algorithm 1, Zt is only determined 
by £i, ■ • ■ but not by ^ . Hence, E((z t - x* , A t ) |£i, . . .,£ t -i) = because EA t = 0. By the 
iterative property of expectation, we have 

E (z t - x*,A t ) = E(E((z t - x*, A t ) |&, . . . = E0 = 0. 

Hence, if we subtract cf){x*) from both sides of inequality (10) and take the expectation, we will 
have 



E(4>(x t+ i) - <f>(x*)) (11) 
E( 
B t o 2 

'2L{ lt -e t y 



< (1 - 9 t )(E(<f>(x t )) - <j>(x*)) + e at ^E\\x* - z t \\ 2 - 9 tlt ^E\\x* - z t+1 \\ 2 



Moreover, we divide both sides of inequality (11) by 9 t j t an d get 

-L(E(<P(x t+1 )) - ct>(x*)) (12) 
9 tit 

< l^(E^{x t ))-cj>(x*)) + ±E\\x* -z t \\ 2 -^E\\x* -z t+l \\ 2 + 



Otlt v " " v " 2 11 2 11 *™ 2L 7t ( 7t -0 t ) 



9t-i7t-i 2 11 111 2 11 t+i " 2L 7 *( 7t -0 t )' 
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where the second inequality comes form Lemma 1 . 
By applying inequality (12) recursively, we obtain 



N 



-i-(E(^ +1 )) - 000) < ±E\\x* z \\ 2 + U - £ 1 (13) 

The definitions of t and 7 t imply 

A 1 = A it + 2) 2 (N + 2)(N + 3)(2N + 5)L 2 



< 



^ 4(Ni/L + 2)(Ni/L + l) ~ 24^ 
12N 3 L 2 _ L 2 
247V 3 ~ ~2~' 



which, together with inequality (13), implies: 

E(tf>(x N+ i)) - <j>(x*) 

< e NlN (^ + ^) 



4 Ni,L~ a 2 L 8 2 cr 2 ^ 

< (— D H )H (—DA ) 

" (iV + 2) 2 L 2 4 (./V + 2) 2 2 4 ; 

2L> 2 + a 2 4£> 2 + 2cr 2 

" (N + 2) l 2 + (N + 2) 2 ■ 



□ 



Remark 2. Algorithm 1 obtains an asymptotically rate of convergence E(<f>(xN+i) — = 
(9(^=) which is the same as the convergence rate of the AC-SA algorithm proposed by Ghadimi 
and Lan [6] up to a constant factor. This convergence rate is also known to be asymptotically 
optimal (see [17]) in terms of the number of iterations N. 

3 Smoothing Stochastic Gradient Descent Algorithm 

Notice that a projection mapping 

zt+i = Mgmm{(x,G(y t ,Z t )} + ^\\x - z t \\ 2 + h(x)} (14) 

must be solved in the step 3 of Algorithm 1 . Similar type of projection mappings also appear in other 
gradient or stochastic gradient algorithms. As indicated in Section 1, (14) does not necessarily have 
a closed from solution. This happens, in particular, in group lasso problem with overlapped group 
structures [10] and fused lasso [23]. In this case, another iterative algorithm has to be designed for 
solving this projection mapping in each iteration of Algorithm 1, which could make Algorithm 1 
very slow for practical applications. 

In order to modify Algorithm 1 the problems whose corresponding projection mappings have no 
closed form, we utilize the smoothing technique proposed by Nesterov [19] to construct a smooth 
approximation for problem (3) before we apply Algorithm 1 . 

Suppose the non-smooth part h(x) in (3) can be represented as 

h(x) — maxv T Ax, 
veQ 

we consider the function 

hu{x) = max{n T Ax — pd(v)}. (15) 

veQ 

Here, the parameter [i is a positive constant and d(v) is a smooth and strongly convex function on 
Q. According to [19], the function h^(x) is a smooth lower approximation for h(x) if \i is positive. 
In fact, it can be shown that 

hn(x) < h(x) < h^x) + fiM for all x, 
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where M — max„ e Q d(v). Hence, the parameter [i controls the accuracy of approximation. 

We denote by v^(x) the optimal solution of the maximization problem involved in (15). Since d(v) 
is strongly convex, v^x) is well-defined because of the uniqueness of the optimal solution. It is 
proved in [19] that h^(x) is a smooth function whose gradient is 

Vh^x) = A T v^x). (16) 

Therefore, the function 4>^{x) = f(x) + h fl (x) performs as a smooth lower approximation for cj)(x) 
in problem (3) and we have 

<?V( X ) ^ <t>( x ) ^ <M X ) + M M for a U x. (17) 
By (16), the gradient of (/> M (x) is 

Vcj) fl (x) = Vf(x)+A T v fl (x). (18) 

and G{x,£) + A T v fl {x) provides its stochastic approximation. It is easy to see that E(G(x,£) + 
A T v fl (x)) = V<j> fl (x) andEHV^^) - G(x,£) - A T v^x)\\ 2 < o 2 for any x . 

It is also shown in [19] that the gradient \7cf)^(x) is Lipschitz-continuous with a Lipschitz constant 

L,=L+±-\\A\\ 2 , (19) 

where \\A\\ = ma,x^ x ^ =1 ^\ y ^ =1 y T Ax and c > is the strong convexity parameter of function d{v). 

Since (t>^(x) is a smooth function with a stochastic gradient G(x, £) + A T v ll {x) at each x, we can 
apply Algorithm 1 to minimize (pn(x). When the smooth parameter fi is small enough, the solution 
we get will also be a good approximate solution for (3). This modified algorithm is proposed as 
Algorithm 2 as follows. 



Algorithm 2 Smoothing Stochastic Gradient Descent Algorithm (SSG) 

Input: The total number of iterations 7Y, the Lipschitz constant L for f(x) and the smooth parameter 

3 

Initialization: Compute the Lipschitz constant by (19). Choose t = an< ^ "ft = Wi^TT + 
2). Set x = 0, z = and t = 0. 
Iterate for t = 0,1, 2,... TV: 

1. y t = (1 - 9 t )x t + t zt 

2- v^(y t ) = argmax veQ {v T Ay t - (J,d(v)} 

3. Generate a random vector £ t independently from the distribution of £ 

4. zt+i = aigmin x {(a;,G(ift,&) + A T Vfl (y t )) + ^\\x- z t \\ 2 } 

5. x t+1 = (1 - e t )x t + 9 t z t +i 
Output: x N+ i 



Remark 3. Similar to the step 3 in Algorithm 1, Algorithm 2 also has to solve a projection mapping 
in step 4. However, since (j)^(x) does not contain a non-smooth term like h(x) in (3), the projection 
mapping in step 4 is simply an unconstrained quadratic programming whose optimal solution has a 
closed form. 

Since Algorithm 2 just solves an approximation of (3), we have to make the smooth parameter /i 
small enough in order to make the solution returned by Algorithm 2 a near-optimal one for (3). 
However, according to (19) and Theorem 1, decreasing smooth parameter p will increase the Lip- 
schitz constant and more iterations will be needed in Algorithm 2 in order to minimize (fr^x). 

Fortunately, by Theorem 1, the Lipschitz constant only appears in the O(jfi) component of 
the convergence rate, which is dominated by the 0{ N \ f2 ) component. This means that, as long 
as p, = O(j^) with 7 < | ,which implies = 0(7Y 7 ) with 7 < §, the convergence rate of 
Algorithm 2 is still 0(^/2). Based on this observation, we prove the following convergence result 
for Algorithm 2 when fi = O(j^). The similar results can be found in [19] and [12]. 
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Theorem 2. If we set /i = (j]^) in Algorithm 2, then after N iterations, we will have: 

* ^ 2£> 2 + ct 2 4D 2 + 2a 2 \\A\\ AD 2 + 2ct 2 

Proof. Because (p(xN+i) — <Ph(xn+i) < [iM and 4>^{x*) — <fi(x*) < 0, we have 

E4>{x N+1 ) - <p{x*) (20) 
= E(j)(x N+1 ) - E0 M (a;Ar +1 ) + E^zjv+i) - <f>^{x*) + 4>n{x*) - 4>(x*) 

2L> 2 +cr 2 1 „ 4D 2 + 2cr 2 

< uM + r + (L + — A 2 )— — — 7r . 

~ p (TV + 2)i V c/i" 11 ; (iV + 2) 2 

where the last inequality is by Theorem 1 and (19). Setting fj, = J^ 2 ) ' we ^ ave 

, „ 2£ 2 + (7 2 4D 2 + 2(7 2 || A|| , 4£ 2 + 2<7 2 



□ 



Remark 4. TTi/i theorem shows a difference between exact gradient descent and stochastic descent 
algorithm when smoothing technique is applied. The gradient descent algorithm proposed in [19] 
obtains a convergence rate ofO(^) but it has to be reduced to O(jj) after applying the smoothing 
technique. However, for Algorithm 2, smoothing technique only slows down a non-dominating com- 
ponent in the convergence rate such that Algorithm 2 still obtains a convergence rate of 0( N \ /2 ) 
which is the same as Algorithm 1. In other words, the price paid for incorporating a smoothing 
technique is negligible. 

Suppose the smooth component f(x) in the objective function is not just convex but also strongly 
convex, the stochastic gradient algorithms developed in [6] and [8] can achieve a convergence rate 
of O(jj). Similar to the only convex cases, this convergence rate consists of two components, one 
term of O(^) which is not dominating but contains the Lipschitz constant L and one term of O(j^) 
which is the bottle neck but independent of L. Hence, by the same reasons as above, if we incorpo- 
rate the smooth technique into the algorithms in [6] and [8] for strongly convex objective functions 
just as we did in Algorithm 2, we can obtain similar smoothing stochastic gradient algorithms with 
a convergence rate O(^). 

4 Numerical Results 

In this section, we apply our algorithms to four different types of regularized regression problems 
which belong to the class of problems formulated by (3). We compare our numerical results with 
the AC-SA algorithm proposed by Ghadimi and Lan in [6]. We used a Matlab implementation and 
ran the experiments in a computer with an Intel(R) Core(TM)2 Duo CPU T8300 2.40GHz processor 
and 2.00GB RAM. 

4.1 Regularized Linear Regression with Discrete Probability Distribution 

Suppose there are K data points {(xi, Vi)}f = i with Xi G W and ?/j e R. The task of linear 
regression is to find the parar 
average square loss function 

fm i^ Wl^l i 

k - 2K W ~ n 

i=i 

where X = \x\, . . . , xk] T and y = [y\, . . . , dk] T ■ Here, we assume each instance (xi, yi) occurs 
with equal chance, i.e., with a probability so that /i(/3) is essentially the | multiple of the 
expectation of the square loss (x T (3 — y) 2 . 



regression is to find the parameters (3 e W to fit the linear model y = (3 T x + e by minimizing the 

K " Jfl „, 1 1 2 

2 
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The gradient of f\ (/3) is 

VA(/3) = lx T (X/3-y). 

It is easy to prove that V/i(/3) is Lipschitz continuous with a Lipschitz constant L 
which denotes the largest eigenvalue of matrix X T X. 

Since we are testing stochastic gradient descent algorithms, we have to generate the stochas- 
tic gradient for /i(/3) in each iteration. We first randomly sample a subset {(:Ej, Ui)}i£S with 
S C {1,2,..., K} from the whole data set and the stochastic gradient G\{(3, S) corresponding 
to this sample is 

G 1 (/3,5) = ^X^(X s /3-ys), (21) 

where X s and Y s are sub-matrices of X and Y whose rows are indexed by the elements of S. 

When the data points belong to a high dimensional space, we are interested in selecting a small 
number of input features of the data which contribute most to influence the output. Hence, we want 
to minimize fi (0) with a regularization term fl(/3) which forces a highly sparse /3 with zeros in the 
components corresponding to the less relevant input features. Then the regularized linear regression 
problem is defined as 

mmA(/3) + Ar!(/3), (22) 
where A is the parameter that controls the regularization level. 

In our numerical experiments, we consider two different choices of f2(/3). One choice is simply the 
^i-norm of /3, i.e., 

fii(/3) = (23) 

A linear regression problem regularized by fii(/3) is also known as a lasso problem [22]. 

We apply Algorithm 1 (SG) and Algorithm 2 (SSG) proposed in this paper and also the AC-SA 
algorithm proposed in [6] to problem (22) with f2(/3) = Oi(/3). In this case, the projection mappings 
in both SG and AC-SA have a closed form solution (see [13]). In order to apply SSG, we observe that 
the non-smooth term in the objective function can be represented as Afii(/3) = max|| ce || oo<1 a T A(3 
where A = XI and we choose d(a) = | ||c*|| 2 as the strongly convex function in SSG. 

We randomly generate a dataset {{xi, yi)}fLi as follows. First of all, we choose the real parameter 

/3 € R p to be [1, 1, . . . , 1, 0, 0, . . . , 0] T with firstp/2 components equal to 1 and lastp/2 components 
equal to 0. And then, we generate each data point Xi G W by generating each of its component 
Xij from a standard normal distribution N(0, 1) independently and we generate yi by setting = 
/3 T Xj + Cj/lO with Cj generated from a standard normal distribution N(0, 1). 

We generate a set of data as above with K = 1000 and p = 20 and we set the parameters A = 0.1 
and the total number of iterations N = 50000. In each iteration, we randomly sample 10 data points, 
i.e. \S\ = 10, to generate the stochastic gradient G(j3, S) by (21). The numerical performances of 
these three algorithms are shown in the left figure in Figure 1. The horizontal line represents the 
CPU running time and the vertical line represents the value of objective function. 

Similarly, we apply these three algorithms to problem (22) with 0(/3) = on a larger dataset 

with K = 100000 and p = 200. We still set A = 0.1 and N = 50000 but we increase the sample 
size |5| to 100. The decreases of the objective values with time by these three algorithms are shown 
in the right figure in Figure 1 . 

The other choice for fi(/3) is the overlapped group sparsity inducing norm introduced by Jenatton 
et al. [10]. Suppose the set of groups of inputs Q = {g 1 , . . . , <7|gi} is a subset of the power set of 
{1, 2, . . . ,p}, the overlapped group sparsity inducing norm 2 (/3) is defined as 

fi 2 G3) = 5> fl ||0J. ( 24 ) 

see 

where (3 g € M' 9 ' is a sub- vector of /3 which only contains the components of j3 indexed by the 
elements of g and w g is a positive constant for each g e Q. 



~ -^max(X T X) 
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Figure 1: linear regression with t\ -norm regularization 

Left: K = 1000, p = 20; Right: A" = 100000, p = 200. 



To be specific, in our numerical experiments, we set p — 2™ for a positive integer n and w g 
and we define the set of groups of inputs Q as follows 



30,1,90,2, 
91,1, 51,2, 



, • ■ • ,90,2" 



9i,l, 
•■•■}■ 

9n,l 



, <?i,2" 



(25) 



1)2* + 2, .. . ,j2 l }fori = 1,2,..., n and j = 1,2, 



(26) 



where 

ff W = {0'-i)2 i + i,(i 

This particular type of overlapped group sparsity inducing norm is also called hierarchical norm [10]. 

We apply algorithm SG, SSG and AC-SA to the problem (22) with fi(/3) = fi 2 (/3). hi this case, the 
projection mappings in SG and AC-SA algorithms no longer have a closed form solutions. Jenatton 
et al. [21] propose a coordinate descent method which can solve the projection mappings within 
\Q\ iterations. We adopt their method as a subroutine for solving the projection mappings when we 
apply SG and AC-SA to the hierarchical norm regularized regression problem. 

In order to apply SSG, we need to reformulate the non-smooth term Af22 with formulation (2). 
Since the dual norm of Euclidean norm is Euclidean norm itself, ||/3 || = maxi| Q , o ||< 1 ot^f3 g , where 



1 T 



</|C 



be the 



OL g <E K' s ' is the vector of auxiliary variables associated to (3 g . Let a = 

vector of length ^2 ge g \g\ and denote the domain of a by Q = {a | ||a 9 || < 1, Vg € G}- Note 
that, Q is the Cartesian product of unit balls in Euclidean space which is a closed and convex set. 
We can rewrite Af2 2 (/3) as: 

Afio(/3) = ^ / w a max aT/3„ = max > Aw o:^/3 n = maxct T yl/3, (27) 

^ ||a„||<l y y a£Q^ y y a£Q 



see 



see 



where A e 



is a matrix such that ^4/3 = 

A are indexed by all pairs of (i, g) such that i e {1, 
j G {1, . . . ,p} and A is defined as: 



The rows of 



SlSI^Sis 

p} , i G 5 and its columns are indexed by 



-4 



XlVg 





if i = j 
otherwise 



(28) 



Different from SG and AC-SA, the projection mapping in AC-SA always has a closed form solution. 

We generate a dataset {(xj, Ui)}fLi m the same way as before with K = 1000 and n = 5 (p = 
2" = 32) and set the parameters A = 0.1, the total number of iterations N = 10000 and the sample 
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size | SI = 10. The numerical results by these algorithms on this data set are shown in the left figure 
in Figure 2. Also, we generate a larger data set with K = 100000 and n = 9 (p = 2™ = 512) and 
run the algorithms on it with A = 0.1, TV = 10000 and \S\ = 100. The numerical results are posted 
in the right figure in Figure 2. 

From Figure 1 and Figure 2, we can imply that even though the SG, SSG and AC-SA have the same 
0(h=) theoretical convergence rate, their performances are different in practical applications. In 
three out of the four experiments, SG and SSG are more efficient than AC-SA. One of the reasons for 
this result is that AC-SA algorithm chooses a very small step length (see the choice of % m Propo- 
sition 4 in [6]) in order to mitigate the impact from the inaccuracy of the stochastic gradient. This 
is needed in the theoretical proof of the convergence rate of AC-SA. However, when the stochastic 
gradient is a good approximation for the exact gradient, e.g., a 2 is very small, the small step length 
in AC-SA is too conservative. Instead, SG and SSG adopt a relatively larger step length such that 
they can reduce the objective functions more efficiently. 

The influence of the smoothing technique by Nesterov [19] is also reflected by these numerical 
results. When Qi(/3) is chosen as the regularization term, the projection mapping in SG has a 
closed form solution so that applying the smoothing technique is not necessary. Hence, in Figure 1, 
the blue curve (SG) and the green curve (SSG) almost overlap. This is comply with that fact that SG 
and SSG have a same 0(^=) complexity shown by Theorem 1 and Theorem 2. 

However, in Figure 2, we can see that SSG is much more efficient than SG. This is because SG 
has to use a coordinate descent method to solve the projection mapping due to the lack of closed 
form solution when Q2 (/3) is the regularization term. Even though the coordinate descent method is 
shown to converge after finite steps. It is still not necessarily faster than solving it by a closed form 
which is available in SSG. 



4.2 Regularized Linear Regression with Continuous Probability Distribution 

Here, we apply our algorithms again on the regularized linear regression problems. However, this 
time, we assume that there are infinitely many data points [x, y) € W +1 which follow a continuous 
distribution p{x, y). The task is still to find the parameters /3 6 W to fit the linear model y = (3 T x+e 
by minimizing the average square loss function 

JT(/3) = ^E(x T f3 -y) 2 = ^J (x T (3 - y) 2 p(x, y)dxdy. 

In our numerical experiment, we make x follow the standard normal distribution in W, i.e, p(x) = 

N(0, 1). The real parameters j3 are chosen to be [1, 1, . . . , 1, 0, 0, . . . , 0] T with first p/2 components 
equal to 1 and last p/2 components equal to and the error term e in the linear model is assumed 
to have a standard normal distribution AT(0, 1) so that the variable y follows a normal distribution 
p(x\y) — N(x T (3, 1) once x is fixed. By these settings, the distribution p(x,y) in our numerical 
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Figure 3: linear regression with £i-norm regularization and K = +00, p = 1000 



-/3 *) 2 _ 



experiments is 

1 _1 T 1 

p(x,y)=p(x)p(y\x) = 7 — i: e 2 " "—re 

(2tt)2 (2vr)2 (2tt) 

It is easy to show that, in this case, the loss function /f (/3) becomes 

/i c (/3) = ^(/3 T /3-2/3 T 3 + ! + l), 

whose gradient is simply V/f(/3) = /3 — (3 with a Lipschitz constant L = 1. 

Similar to the discrete cases, we apply our algorithms to the following regularized linear regression 
problem 

mmJT(/3) + Af!(/3), (29) 

where f2(/3) is the regularization term. 

In order to generate a stochastic approximation for V/f (/3), in each iteration, we sample a set of 
points S = {(xi, yi)}j=i,...,|s| by generating x£ from N(0,I) and from 7V(0, 1) and setting 
= /3 + £i for i — 1, . . . , \S\. Then we can compute a stochastic gradient G(/3, S) by (21). 

First, we apply AC-SA, SG and SSG algorithms on (29) with Q(/3) = Qi(/3), p = 1000, \S\ = 10 
and A = 0.1. The performances of these algorithms are shown in Figure 3. 

Then, we apply these three algorithms on on (29) with f2(/3) = C/3)-. n = 8(p = 2™ = 256), 
|5| = 100 and A = 0.1. The numerical performances are presented in Figure 4. 

Figure 3 and Figure 4 reflect the similar phenomenons as Figure 1 and Figure 2. SG and SSG 
converge faster than AC-SA and when the regularization term is complicated, SSG significantly 
outperforms the other two algorithms. 

4.3 Regularized Logistic Regression 

Suppose there are K data points {(xi,yi)}fL 1 , where each Xi £ K p is the predictor with its Eu- 
clidean norm ||a;j|| = 1 and y^ G {0, 1} is the class label of x.- L which indicates that Xi belongs to 
class or class 1. We assume that the posterior probability of the class label of a particular predictor 
x is given by 

Pr(y = l\x) = (30) 
Pr(y = 0\x) = —^ (31) 
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Figure 4: linear regression with hierarchical norm regularization and K = +00, p = 256 



for some (3 e W. The task of logistic regression is to find the parameters (3 £ M. p by minimizing 
the minus log-likelihood corresponding to this set of data points, which is defined as 

K K 

£09) = -^E lo s( Pr (^)) = kY. [log(l + e^) -VifFxi . 

i=l i=l 

Similar to the regularized linear regression problem, we minimize fa (/3) together with a regulariza- 
tion term fi(/3) in order to obtain a sparse solution /3. Hence, the regularized logistic regression can 
be formulated as 

nun/ a G9) + AfiG3), (32) 

where Sl(j3) can also be chosen to be Sli(j3) or the hierarchical norm SI2 (/3) defined by(24), (25) 
and (26). 

The gradient of fa(P) is the following 

I K / P T x- \ 
V/2(/3) = ^E(^ T ^ 7 - yi J Xi . 

Because each data point satisfies ||xi|| = 1, it can be shown that Vfa((3) is Lipschitz continuous 
with a Lipschitz constant L = 1. Similar to the regularized linear regression problems, we randomly 
sample a subset {(a;,-, yi)}i^s with S C {1, 2, ... , K} from the whole data set and generate the 
stochastic gradient Ga(/3, S) for fa(f3) as follows 

G 2 (/3,5) = M g^ IT -^ 7 -^Jx i . (33) 

Now we apply SG, SSG and AC-SA to problem (32) with Q(/3) = Qi(/3). We create a set of 
artificial data {(x^ with A' = 1000 and p = 20 as follows. At first, we choose the real 

parameter j3 to be an all-ones vector in W. After that, for each i — 1, . . . , K, we create a 5^ by 
generating each of its component x,j from a standard normal distribution A(0, 1) independently and 
we get Xi by normalizing i.e., Xi — Xi/\\xi\\. The corresponding yi is set to be 1 or randomly 
with the probabilities defined by (30). We set the sample size |5| = 10, A = 0.01 and the number 
of iterations N = 50000 in all of the three algorithms. The numerical performances are presented 
in Figure 5. 

For the problem (32) with 0(/3) = f2a(/3)> we generate the data in the same way as above but with 
K = 1000 and n = 5 ( p = 32 ). Still, \S\, A and N are set to be 10, 0.01 and 50000 respectively. 
We put the curves in Figure 6 to show how the objective values decrease in these algorithms. 



14 





-■- 


AC-SA 




SG 


,, x . 


SSG 



50 100 150 

CPU TIME (sec) 



gure 6: logistic regression with hierarchical norm regularization and K = 1000, p 
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The properties of our algorithms shown by Figure 5 and Figure 6 are very similar to what are 
shown in Figure 1,2,3,4. In the ^i-norm regularized logistic regression problems, SG and SSG are 
more efficient than AC-SA due to the more aggressive choices of the step lengths. In the cases 
of hierarchical norm regularized logistic regression, SSG is more efficient than the other two just 
because SSG has a closed form solution for its projection mapping but SG and AC-SA have to rely 
on another algorithm as a subroutine to solve their projection mappings. 



5 Conclusion 

In this paper, we consider an optimization problem whose objective function is a composition of a 
smooth convex function and a non-smooth convex function. We first developed a stochastic gradient 
descent algorithm for solving this problem. We also proposed another stochastic gradient descent 
algorithm by smoothing the non-smooth term in the objective function. The convergence rates of 
these two algorithm are proved. The results of our numerical experiments demonstrate efficiency 
and scalability of our algorithms. 
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